By: Suraj Honkamble
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('D:\DATA SCIENCE Internship with CodersCave\Data\\imdb_top_1000.csv')
df.head()
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28,341,469 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134,966,411 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534,858,444 |
| 3 | https://m.media-amazon.com/images/M/MV5BMWMwMG... | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | The early life and career of Vito Corleone in ... | 90.0 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57,300,000 |
| 4 | https://m.media-amazon.com/images/M/MV5BMWU4N2... | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 | A jury holdout attempts to prevent a miscarria... | 96.0 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4,360,000 |
df.drop(['Poster_Link', 'Overview'], axis=1,inplace=True)
df.columns
Index(['Series_Title', 'Released_Year', 'Certificate', 'Runtime', 'Genre',
'IMDB_Rating', 'Meta_score', 'Director', 'Star1', 'Star2', 'Star3',
'Star4', 'No_of_Votes', 'Gross'],
dtype='object')
df.shape
(1000, 14)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Series_Title 1000 non-null object 1 Released_Year 1000 non-null object 2 Certificate 899 non-null object 3 Runtime 1000 non-null object 4 Genre 1000 non-null object 5 IMDB_Rating 1000 non-null float64 6 Meta_score 843 non-null float64 7 Director 1000 non-null object 8 Star1 1000 non-null object 9 Star2 1000 non-null object 10 Star3 1000 non-null object 11 Star4 1000 non-null object 12 No_of_Votes 1000 non-null int64 13 Gross 831 non-null object dtypes: float64(2), int64(1), object(11) memory usage: 109.5+ KB
df.isna().sum()
Series_Title 0 Released_Year 0 Certificate 101 Runtime 0 Genre 0 IMDB_Rating 0 Meta_score 157 Director 0 Star1 0 Star2 0 Star3 0 Star4 0 No_of_Votes 0 Gross 169 dtype: int64
df['Released_Year'].unique()
array(['1994', '1972', '2008', '1974', '1957', '2003', '1993', '2010',
'1999', '2001', '1966', '2002', '1990', '1980', '1975', '2020',
'2019', '2014', '1998', '1997', '1995', '1991', '1977', '1962',
'1954', '1946', '2011', '2006', '2000', '1988', '1985', '1968',
'1960', '1942', '1936', '1931', '2018', '2017', '2016', '2012',
'2009', '2007', '1984', '1981', '1979', '1971', '1963', '1964',
'1950', '1940', '2013', '2005', '2004', '1992', '1987', '1986',
'1983', '1976', '1973', '1965', '1959', '1958', '1952', '1948',
'1944', '1941', '1927', '1921', '2015', '1996', '1989', '1978',
'1961', '1955', '1953', '1925', '1924', '1982', '1967', '1951',
'1949', '1939', '1937', '1934', '1928', '1926', '1920', '1970',
'1969', '1956', '1947', '1945', '1930', '1938', '1935', '1933',
'1932', '1922', '1943', 'PG'], dtype=object)
df[df['Released_Year']=='PG']
| Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 966 | Apollo 13 | PG | U | 140 min | Adventure, Drama, History | 7.6 | 77.0 | Ron Howard | Tom Hanks | Bill Paxton | Kevin Bacon | Gary Sinise | 269197 | 173,837,933 |
df['Released_Year'].replace({"PG":"2019"}, inplace=True)
df['Released_Year']=df['Released_Year'].astype(int)
df['Released_Year'].dtype
dtype('int32')
df['Runtime'].dtype
dtype('O')
df['Runtime']=df['Runtime'].apply(lambda x:x.replace(" min",""))
df['Runtime']=df['Runtime'].astype(int)
df['Runtime'].dtype
dtype('int32')
df['Certificate'].isna().sum()
101
df['Certificate'].unique()
array(['A', 'UA', 'U', 'PG-13', 'R', nan, 'PG', 'G', 'Passed', 'TV-14',
'16', 'TV-MA', 'Unrated', 'GP', 'Approved', 'TV-PG', 'U/A'],
dtype=object)
Source : https://www.imdb.com/
From this source i found that the ratings for these movies are mising, here the Content Rating is "Unrated". So it is better to impute the missing Certificate for Content Rating as Unrated.
df['Certificate'].fillna('Unrated', axis=0, inplace=True)
df['Certificate'].isna().sum()
0
df['Meta_score'].mode()
0 76.0 dtype: float64
df['Meta_score'].fillna(76.0, axis=0, inplace=True)
df['Meta_score'].isna().sum()
0
df['Meta_score'].unique()
array([ 80., 100., 84., 90., 96., 94., 74., 66., 92., 82., 87.,
73., 83., 76., 79., 91., 61., 59., 65., 85., 98., 89.,
88., 57., 67., 62., 77., 64., 75., 97., 99., 78., 68.,
81., 95., 69., 55., 70., 58., 86., 71., 63., 93., 72.,
60., 47., 49., 50., 33., 54., 56., 51., 53., 48., 44.,
45., 40., 52., 28., 36., 46., 30., 41.])
df['Meta_score']=df['Meta_score'].astype(int)
df['Meta_score'].dtype
dtype('int32')
df['Gross'].isna().sum()
169
df['Gross'].fillna('0', axis=0, inplace=True)
df['Gross'].isna().sum()
0
df['Gross']=df['Gross'].apply(lambda x: x.replace(",",""))
df['Gross']=df['Gross'].astype(float)
df['Gross'].dtype
dtype('float64')
df['Gross'].plot(kind='box')
<AxesSubplot:>
Many outliers in this columns so it is better to fill the missing value(Now the value is 0) with median.
df['Gross']=df['Gross'].replace(0, df['Gross'].median())
np.log(df['Gross']).plot(kind='kde');
df['Gross']=(df['Gross']/1000000).round(3)
df['Gross'].head()
0 28.341 1 134.966 2 534.858 3 57.300 4 4.360 Name: Gross, dtype: float64
df.head()
| Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | A | 142 | Drama | 9.3 | 80 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28.341 |
| 1 | The Godfather | 1972 | A | 175 | Crime, Drama | 9.2 | 100 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134.966 |
| 2 | The Dark Knight | 2008 | UA | 152 | Action, Crime, Drama | 9.0 | 84 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534.858 |
| 3 | The Godfather: Part II | 1974 | A | 202 | Crime, Drama | 9.0 | 90 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57.300 |
| 4 | 12 Angry Men | 1957 | U | 96 | Crime, Drama | 9.0 | 96 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4.360 |
df.shape
(1000, 14)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Series_Title 1000 non-null object 1 Released_Year 1000 non-null int32 2 Certificate 1000 non-null object 3 Runtime 1000 non-null int32 4 Genre 1000 non-null object 5 IMDB_Rating 1000 non-null float64 6 Meta_score 1000 non-null int32 7 Director 1000 non-null object 8 Star1 1000 non-null object 9 Star2 1000 non-null object 10 Star3 1000 non-null object 11 Star4 1000 non-null object 12 No_of_Votes 1000 non-null int64 13 Gross 1000 non-null float64 dtypes: float64(2), int32(3), int64(1), object(8) memory usage: 97.8+ KB
df['Series_Title'].nunique()
999
df.duplicated(['Series_Title']).sum()
1
df[df.duplicated(['Series_Title'])]
| Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 136 | Drishyam | 2015 | UA | 163 | Crime, Drama, Mystery | 8.2 | 76 | Nishikant Kamat | Ajay Devgn | Shriya Saran | Tabu | Rajat Kapoor | 70367 | 0.739 |
df[df['Series_Title']=='Drishyam']
| Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 87 | Drishyam | 2013 | U | 160 | Crime, Drama, Thriller | 8.3 | 76 | Jeethu Joseph | Mohanlal | Meena | Asha Sharath | Ansiba | 30722 | 10.703 |
| 136 | Drishyam | 2015 | UA | 163 | Crime, Drama, Mystery | 8.2 | 76 | Nishikant Kamat | Ajay Devgn | Shriya Saran | Tabu | Rajat Kapoor | 70367 | 0.739 |
One is Original that is Telugu version and other one is copied Hindi version.
df['Released_Year'].nunique()
99
df['Released_Year'].value_counts().sort_values(ascending=False)[:10]
2014 32 2004 31 2009 29 2013 28 2016 28 2001 27 2007 26 2006 26 2015 25 2012 24 Name: Released_Year, dtype: int64
df['Released_Year'].value_counts().sort_values(ascending=True)[:10]
1943 1 1920 1 1926 1 1937 1 1922 1 1930 1 1921 1 1936 1 1924 1 1941 2 Name: Released_Year, dtype: int64
movie_by_year=pd.DataFrame(df['Released_Year'].value_counts())
movie_by_year.reset_index(inplace=True)
movie_by_year.columns=['Released_Year','Number_of_Movies']
movie_by_year.head(5)
| Released_Year | Number_of_Movies | |
|---|---|---|
| 0 | 2014 | 32 |
| 1 | 2004 | 31 |
| 2 | 2009 | 29 |
| 3 | 2013 | 28 |
| 4 | 2016 | 28 |
fig=px.bar(data_frame=movie_by_year, x='Released_Year', y='Number_of_Movies',
color='Released_Year',text='Released_Year',
title="Number of High IMDB Rated movies released by Year")
fig.show()
From Graph we can clearly see that in year 1923 and 1929 no high rated movie is released.
df['Certificate'].unique()
array(['A', 'UA', 'U', 'PG-13', 'R', 'Unrated', 'PG', 'G', 'Passed',
'TV-14', '16', 'TV-MA', 'GP', 'Approved', 'TV-PG', 'U/A'],
dtype=object)
plt.figure(figsize=(12,4))
sns.countplot(data=df, x='Certificate')
plt.title('Number of high rated movies by Content Type', fontsize=16, color='blue');
U-Unrestricted Content type has most top rated movies.TV-14,16,TV-MA and U/A has least top rated movies.
print("Logest Duration", df['Runtime'].max())
print("Shortest Duration", df['Runtime'].min())
print("Average Duration", df['Runtime'].mean())
Logest Duration 321 Shortest Duration 45 Average Duration 122.891
plt.figure(figsize=(12,4))
sns.distplot(df['Runtime'])
plt.title('Distribution of Runtime or Movie Duration', fontsize=16, color='blue');
100 to 150 minutes.321 minutes, Shortest run-time duration is 45 minutes and Average run-time duration is 122 minutes.
df['Genre'].nunique()
202
df['Genre'].value_counts().sort_values(ascending=False)[:5].plot(kind='barh', figsize=(12,3));
Drama Genre Category has more top rated movies.
print("Maximum IMDb rating", df['IMDB_Rating'].max())
print("Minimum IMDb rating", df['IMDB_Rating'].min())
print("Average IMDb rating", df['IMDB_Rating'].mean())
Maximum IMDb rating 9.3 Minimum IMDb rating 7.6 Average IMDb rating 7.949300000000012
plt.figure(figsize=(12,4))
sns.countplot(data=df, x='IMDB_Rating')
plt.title('Number of high rated movies by IMDb Rating', fontsize=16, color='blue');
Most of the High rated movies got IMDb rating between 7.6 to 8.1.
print("Least Meta score for Movie",df['Meta_score'].min())
print("Highest Meta score for Movie",df['Meta_score'].max())
print("Average Meta score for movie", df['Meta_score'].mean())
Least Meta score for Movie 28 Highest Meta score for Movie 100 Average Meta score for movie 77.662
Higher the Meta Score better is the movie revie.
plt.figure(figsize=(12,4))
sns.countplot(df['Meta_score'])
plt.title('Number of high rated movies by Meta Score', fontsize=16, color='blue')
plt.xticks(rotation=90);
Most of the high rated movies got Meta Score 76 from well known critics.
print("Maximum Votes for Movie", df['No_of_Votes'].max())
print("Minimum Votes for Movie", df['No_of_Votes'].min())
Maximum Votes for Movie 2343110 Minimum Votes for Movie 25088
plt.figure(figsize=(12,4))
sns.distplot(df['No_of_Votes'])
plt.title('Distribution of Votes', fontsize=16, color='blue')
plt.xticks(rotation=90);
Right skewd data.
print("Maximum Gross in Million for Movie", df['Gross'].max())
print("Minimum Gross in Million for Movie", df['Gross'].min())
Maximum Gross in Million for Movie 936.662 Minimum Gross in Million for Movie 0.001
plt.figure(figsize=(12,4))
sns.distplot(df['Gross'])
plt.title('Distribution of Gross in Million', fontsize=16, color='blue')
plt.xticks(rotation=90);
1000 $ to 200M $936M $ Gross net.
df['Director'].nunique()
548
(df['Director'].value_counts()[:10]).plot(kind='barh', figsize=(12,4));
Alfred Hitchcock directed totaly 14 top rated movie.
Steven Spielberg directed 13 top rated movies.
star1=pd.DataFrame(df['Star1'].value_counts())
star1.reset_index(inplace=True)
star1.columns=['Star','no_of_movie1']
star1.head()
| Star | no_of_movie1 | |
|---|---|---|
| 0 | Tom Hanks | 12 |
| 1 | Robert De Niro | 11 |
| 2 | Al Pacino | 10 |
| 3 | Clint Eastwood | 10 |
| 4 | Humphrey Bogart | 9 |
star2=pd.DataFrame(df['Star2'].value_counts())
star2.reset_index(inplace=True)
star2.columns=['Star','no_of_movie2']
star2.head()
| Star | no_of_movie2 | |
|---|---|---|
| 0 | Emma Watson | 7 |
| 1 | Matt Damon | 5 |
| 2 | Kate Winslet | 4 |
| 3 | Ian McKellen | 4 |
| 4 | Chris Evans | 4 |
star3=pd.DataFrame(df['Star3'].value_counts())
star3.reset_index(inplace=True)
star3.columns=['Star','no_of_movie3']
star3.head()
| Star | no_of_movie3 | |
|---|---|---|
| 0 | Rupert Grint | 5 |
| 1 | Scarlett Johansson | 4 |
| 2 | Jennifer Connelly | 4 |
| 3 | Rachel McAdams | 4 |
| 4 | John Goodman | 4 |
star4=pd.DataFrame(df['Star4'].value_counts())
star4.reset_index(inplace=True)
star4.columns=['Star','no_of_movie4']
star4.head()
| Star | no_of_movie4 | |
|---|---|---|
| 0 | Michael Caine | 4 |
| 1 | Mark Ruffalo | 3 |
| 2 | Catherine Keener | 3 |
| 3 | Julianne Moore | 2 |
| 4 | Donald Sutherland | 2 |
star=pd.concat([star1,star2, star3, star4], axis=0)
star.head()
| Star | no_of_movie1 | no_of_movie2 | no_of_movie3 | no_of_movie4 | |
|---|---|---|---|---|---|
| 0 | Tom Hanks | 12.0 | NaN | NaN | NaN |
| 1 | Robert De Niro | 11.0 | NaN | NaN | NaN |
| 2 | Al Pacino | 10.0 | NaN | NaN | NaN |
| 3 | Clint Eastwood | 10.0 | NaN | NaN | NaN |
| 4 | Humphrey Bogart | 9.0 | NaN | NaN | NaN |
star.fillna(0, axis=0, inplace=True)
star['In_Movies']=star['no_of_movie1']+star['no_of_movie2']+star['no_of_movie3']+star['no_of_movie4']
star.drop(['no_of_movie1','no_of_movie2','no_of_movie3','no_of_movie4'], axis=1, inplace=True)
star.head()
| Star | In_Movies | |
|---|---|---|
| 0 | Tom Hanks | 12.0 |
| 1 | Robert De Niro | 11.0 |
| 2 | Al Pacino | 10.0 |
| 3 | Clint Eastwood | 10.0 |
| 4 | Humphrey Bogart | 9.0 |
star_in_movies=pd.pivot_table(data=star,
index='Star', aggfunc='max',
values='In_Movies').sort_values(by='In_Movies', ascending=False)[:20]
star_in_movies
| In_Movies | |
|---|---|
| Star | |
| Tom Hanks | 12.0 |
| Robert De Niro | 11.0 |
| Al Pacino | 10.0 |
| Clint Eastwood | 10.0 |
| Humphrey Bogart | 9.0 |
| Leonardo DiCaprio | 9.0 |
| James Stewart | 8.0 |
| Johnny Depp | 8.0 |
| Christian Bale | 8.0 |
| Emma Watson | 7.0 |
| Denzel Washington | 7.0 |
| Toshirô Mifune | 7.0 |
| Aamir Khan | 7.0 |
| Charles Chaplin | 6.0 |
| Daniel Radcliffe | 6.0 |
| Tom Cruise | 6.0 |
| Jake Gyllenhaal | 6.0 |
| Ethan Coen | 6.0 |
| Cary Grant | 6.0 |
| Dustin Hoffman | 5.0 |
star_in_movies.reset_index(inplace=True)
fig=px.bar(data_frame=star_in_movies, x='Star', y='In_Movies',
color='Star',text='In_Movies',
title="Actors Stared in Number of High IMDB Rated movies")
fig.show()
top_10_pop = df[['Star1','Star2','Star3', 'Star4']].values.tolist()[:10]
top_10_pop
[['Tim Robbins', 'Morgan Freeman', 'Bob Gunton', 'William Sadler'], ['Marlon Brando', 'Al Pacino', 'James Caan', 'Diane Keaton'], ['Christian Bale', 'Heath Ledger', 'Aaron Eckhart', 'Michael Caine'], ['Al Pacino', 'Robert De Niro', 'Robert Duvall', 'Diane Keaton'], ['Henry Fonda', 'Lee J. Cobb', 'Martin Balsam', 'John Fiedler'], ['Elijah Wood', 'Viggo Mortensen', 'Ian McKellen', 'Orlando Bloom'], ['John Travolta', 'Uma Thurman', 'Samuel L. Jackson', 'Bruce Willis'], ['Liam Neeson', 'Ralph Fiennes', 'Ben Kingsley', 'Caroline Goodall'], ['Leonardo DiCaprio', 'Joseph Gordon-Levitt', 'Elliot Page', 'Ken Watanabe'], ['Brad Pitt', 'Edward Norton', 'Meat Loaf', 'Zach Grenier']]
plt.figure(figsize=(12,6))
sns.barplot(data=df, y='Series_Title', x='Runtime', ci=None,
order=df.groupby('Series_Title').Runtime.max().sort_values(ascending=False)[:10].index)
plt.title("Top 10 Logest Duration Movies", fontsize=16, color='blue');
The logest Duration movie is Gangs of Wasseypur which a hindi movie with duration 330 minutes.
plt.figure(figsize=(12,6))
sns.barplot(data=df, y='Series_Title', x='Runtime', ci=None,
order=df.groupby('Series_Title').Runtime.max().sort_values(ascending=True)[:10].index)
plt.title("Top 10 shorter Duration Movies", fontsize=16, color='blue');
Shortest duration moview is Sherlock Jr with Runtime 45 minutes.
avg_runtime=pd.DataFrame(df.groupby('Released_Year').Runtime.mean())
avg_runtime.reset_index(inplace=True)
avg_runtime.columns=['Released_Year','Runtime']
avg_runtime.head()
| Released_Year | Runtime | |
|---|---|---|
| 0 | 1920 | 76.0 |
| 1 | 1921 | 68.0 |
| 2 | 1922 | 94.0 |
| 3 | 1924 | 45.0 |
| 4 | 1925 | 85.0 |
fig=px.line(data_frame=avg_runtime, x='Released_Year', y='Runtime',
title="Average runtime of High IMDB Rated movies by each year")
fig.show()
From Graph, the average runtime is almost constant from 60's to 2020.
avg_rating=pd.DataFrame(df.groupby('Released_Year').IMDB_Rating.mean())
avg_rating.reset_index(inplace=True)
avg_rating.columns=['Released_Year','IMDB_Rating']
avg_rating.head()
| Released_Year | IMDB_Rating | |
|---|---|---|
| 0 | 1920 | 8.1 |
| 1 | 1921 | 8.3 |
| 2 | 1922 | 7.9 |
| 3 | 1924 | 8.2 |
| 4 | 1925 | 8.1 |
fig=px.line(data_frame=avg_rating, x='Released_Year', y='IMDB_Rating',
title="Average IMDB Rating by each year")
fig.show()
plt.figure(figsize=(12,6))
sns.barplot(data=df, y='Series_Title', x='IMDB_Rating', ci=None,
order=df.groupby('Series_Title').IMDB_Rating.max().sort_values(ascending=False)[:10].index)
plt.xticks(np.arange(0,10,0.5))
plt.title("Top 10 Movies by IMDB Rating", fontsize=16, color='blue');
The Shawshank Redemption is the highest rated movie with IMDB rating 9.3.
plt.figure(figsize=(12,6))
sns.barplot(data=df, y='Series_Title', x='Meta_score', ci=None,
order=df.groupby('Series_Title').Meta_score.max().sort_values(ascending=False)[:10].index)
plt.title("Top 10 Movies by Meta Score", fontsize=16, color='blue');
The Godfather, Notorious, Boyhood, Casablanca and Fanny oach Alexander. are the top 5 Movies with High Meta Score.
plt.figure(figsize=(12,6))
sns.barplot(data=df, y='Series_Title', x='No_of_Votes', ci=None,
order=df.groupby('Series_Title').No_of_Votes.max().sort_values(ascending=False)[:10].index)
plt.title("Top 10 Movies by No of Votes", fontsize=16, color='blue');
The Shawshank Redemption, The Dark Knights, Inception, Fight Club and Pulp Fiction movies received most votes.
plt.figure(figsize=(12,6))
sns.barplot(data=df, y='Series_Title', x='Gross', ci=None,
order=df.groupby('Series_Title').Gross.max().sort_values(ascending=False)[:10].index)
plt.title("Top 10 Movies by Total Gross Net", fontsize=16, color='blue');
fig=px.scatter(data_frame=df, x='Runtime', y='IMDB_Rating', color='Released_Year',
title="Runtime vs IMDB Rating")
fig.show()
From above Scatter, look like there is no relation between the IMDB rating and the Movie Duration.
num_feature=['Released_Year','Runtime','Meta_score','IMDB_Rating','No_of_Votes','Gross']
plt.figure(figsize=(12,6))
sns.heatmap(df[num_feature].corr(), cmap='Blues', annot=True);
plt.title("Correlation betweeen Numerical Features", fontsize=18, color='blue');
There is high co=relation between Votes and IMDB Rating, and Gross and Votes.
fig=px.scatter(data_frame=df, x='No_of_Votes', y='IMDB_Rating', color='Released_Year',
title="No_of_Votes vs IMDB Rating")
fig.show()
There is a positive correlation between number of votes and IMDB rating, as the Number of votes increases the IMDB Rating increases.
fig=px.scatter(data_frame=df, y='Gross', x='IMDB_Rating', color='Released_Year',
title="Gross vs IMDB Rating")
fig.show()
runtime_by_genre=pd.DataFrame(df.groupby('Genre').Runtime.mean().sort_values(ascending=False)[:10])
runtime_by_genre.reset_index(inplace=True)
runtime_by_genre.columns=['Genre','Runtime']
runtime_by_genre.head()
| Genre | Runtime | |
|---|---|---|
| 0 | Adventure, Drama, Musical | 224.00 |
| 1 | Drama, History, Romance | 181.50 |
| 2 | Drama, Family, Musical | 181.00 |
| 3 | Adventure, Drama, History | 177.25 |
| 4 | Biography, Drama, War | 172.00 |
fig=px.bar(data_frame=runtime_by_genre, x='Genre', y='Runtime', color='Genre',
title="Top 10 Genre with high Average Run time", text='Runtime')
fig.show()
High Average Runtime movies belongs to Adventure, Drama, Musical Genre.
runtime_by_genre=pd.DataFrame(df.groupby('Genre').Runtime.mean().sort_values(ascending=True)[:10])
runtime_by_genre.reset_index(inplace=True)
runtime_by_genre.columns=['Genre','Runtime']
runtime_by_genre.head()
| Genre | Runtime | |
|---|---|---|
| 0 | Comedy, Musical, War | 69.0 |
| 1 | Animation, Sci-Fi | 72.0 |
| 2 | Fantasy, Horror, Mystery | 76.0 |
| 3 | Animation, Comedy, Fantasy | 81.0 |
| 4 | Animation, Crime, Mystery | 81.0 |
fig=px.bar(data_frame=runtime_by_genre, x='Genre', y='Runtime', color='Genre',
title="Top 10 Genre with least Average Run time", text='Runtime')
fig.show()
Comedy, Musical, War type genre has least average runtime.
rating_by_genre=pd.DataFrame(df.groupby('Genre').IMDB_Rating.mean().sort_values(ascending=False)[:10])
rating_by_genre.reset_index(inplace=True)
rating_by_genre.columns=['Genre','IMDB_Rating']
rating_by_genre.head()
| Genre | IMDB_Rating | |
|---|---|---|
| 0 | Animation, Drama, War | 8.50 |
| 1 | Drama, Musical | 8.40 |
| 2 | Action, Sci-Fi | 8.40 |
| 3 | Drama, Mystery, War | 8.35 |
| 4 | Western | 8.35 |
fig=px.bar(data_frame=rating_by_genre, x='Genre', y='IMDB_Rating', color='Genre',
title="Top 10 Genre with highest IMDB Rating", text='IMDB_Rating')
fig.show()
The Animation, Drama, War Genre type movies has highest average rating.
rating_by_genre=pd.DataFrame(df.groupby('Genre').IMDB_Rating.mean().sort_values(ascending=True)[:10])
rating_by_genre.reset_index(inplace=True)
rating_by_genre.columns=['Genre','IMDB_Rating']
rating_by_genre.head()
| Genre | IMDB_Rating | |
|---|---|---|
| 0 | Animation, Drama, Romance | 7.6 |
| 1 | Adventure, Comedy, War | 7.6 |
| 2 | Drama, War, Western | 7.6 |
| 3 | Animation, Comedy, Crime | 7.6 |
| 4 | Action, Adventure, Mystery | 7.6 |
fig=px.bar(data_frame=rating_by_genre, x='Genre', y='IMDB_Rating', color='Genre',
title="Top 10 Genre with least IMDB Rating", text='IMDB_Rating')
fig.show()
metascore_by_genre=pd.DataFrame(df.groupby('Genre').Meta_score.mean().sort_values(ascending=False)[:10])
metascore_by_genre.reset_index(inplace=True)
metascore_by_genre.columns=['Genre','Meta_score']
metascore_by_genre.head()
| Genre | Meta_score | |
|---|---|---|
| 0 | Mystery, Romance, Thriller | 100.0 |
| 1 | Comedy, Musical, Romance | 99.0 |
| 2 | Adventure, Mystery, Thriller | 98.0 |
| 3 | Drama, Fantasy, War | 98.0 |
| 4 | Comedy, Music, Romance | 98.0 |
fig=px.bar(data_frame=metascore_by_genre, x='Genre', y='Meta_score', color='Genre',
title="Top 10 Genre with High Average Meta Score", text='Meta_score')
fig.show()
Mystery, Romance, Thriller Genre type movies has 100 Metascore.
votes_by_genre=pd.DataFrame(df.groupby('Genre').No_of_Votes.mean().sort_values(ascending=False)[:10])
votes_by_genre.reset_index(inplace=True)
votes_by_genre.columns=['Genre','No_of_Votes']
votes_by_genre.head()
| Genre | No_of_Votes | |
|---|---|---|
| 0 | Action, Sci-Fi | 1.157242e+06 |
| 1 | Action, Adventure, Fantasy | 9.547677e+05 |
| 2 | Action, Adventure | 9.255334e+05 |
| 3 | Adventure, Drama, Sci-Fi | 9.125223e+05 |
| 4 | Action, Drama, Sci-Fi | 8.403165e+05 |
fig=px.bar(data_frame=votes_by_genre, x='Genre', y='No_of_Votes', color='Genre',
title="Top 10 Genre with High Average Votes Received", text='No_of_Votes')
fig.show()
Action, Sci-FI Genre tye movies got nearly 1.2M average Votes.
gross_by_genre=pd.DataFrame(df.groupby('Genre').Gross.mean().sort_values(ascending=False)[:10])
gross_by_genre.reset_index(inplace=True)
gross_by_genre.columns=['Genre','Gross']
gross_by_genre.head()
| Genre | Gross | |
|---|---|---|
| 0 | Family, Sci-Fi | 435.111000 |
| 1 | Action, Adventure, Fantasy | 352.723500 |
| 2 | Action, Adventure, Family | 301.959000 |
| 3 | Action, Adventure, Sci-Fi | 280.888524 |
| 4 | Adventure, Fantasy | 280.685500 |
fig=px.bar(data_frame=gross_by_genre, x='Genre', y='Gross', color='Genre',
title="Top 10 Genre with High average Grossing", text='Gross')
fig.show()
Family, Sci-Fi Genre type movies has highest average Gross Net about 435M Dollar. This is expected because some movie earns huge amount worldwide while some earns less.
certificate_by_votes=pd.DataFrame(df.groupby('Certificate').No_of_Votes.mean().sort_values(ascending=False))
certificate_by_votes.reset_index(inplace=True)
certificate_by_votes.columns=['Certificate','No_of_Votes']
certificate_by_votes.head()
| Certificate | No_of_Votes | |
|---|---|---|
| 0 | UA | 439032.582857 |
| 1 | A | 428215.045685 |
| 2 | U | 256106.358974 |
| 3 | R | 212991.869863 |
| 4 | PG-13 | 144101.976744 |
fig=px.bar(data_frame=certificate_by_votes, x='Certificate', y='No_of_Votes', color='Certificate',
title="Top 10 Certificate or Content type by High Average Votes Received", text='No_of_Votes')
fig.show()
U/A Content or Certifcate type category has highest average VOtes.
certificate_by_gross=pd.DataFrame(df.groupby('Certificate').Gross.mean().sort_values(ascending=False))
certificate_by_gross.reset_index(inplace=True)
certificate_by_gross.columns=['Certificate','Gross']
certificate_by_gross.head()
| Certificate | Gross | |
|---|---|---|
| 0 | UA | 122.887000 |
| 1 | U | 76.124855 |
| 2 | A | 59.297640 |
| 3 | G | 43.114083 |
| 4 | PG-13 | 34.506535 |
fig=px.bar(data_frame=certificate_by_gross, x='Certificate', y='Gross', color='Certificate',
title="Top 10 Certificate or Content type by High Average Gross", text='Gross')
fig.show()
director_by_rating=pd.DataFrame(df.groupby('Director').IMDB_Rating.mean().sort_values(ascending=False)[:10])
director_by_rating.reset_index(inplace=True)
director_by_rating.columns=['Director','IMDB_Rating']
director_by_rating.head()
| Director | IMDB_Rating | |
|---|---|---|
| 0 | Frank Darabont | 8.95 |
| 1 | Irvin Kershner | 8.70 |
| 2 | Lana Wachowski | 8.70 |
| 3 | George Lucas | 8.60 |
| 4 | Roberto Benigni | 8.60 |
fig=px.bar(data_frame=director_by_rating, x='IMDB_Rating', y='Director', color='Director',
title="Top 10 Directors by High Average IMDB Rating", text='IMDB_Rating')
fig.show()
Frank Darabont is the most successful director.
director_by_gross=pd.DataFrame(df.groupby('Director').Gross.sum().sort_values(ascending=False)[:10])
director_by_gross.reset_index(inplace=True)
director_by_gross.columns=['Director','Gross']
director_by_gross.head()
| Director | Gross | |
|---|---|---|
| 0 | Steven Spielberg | 2478.135 |
| 1 | Anthony Russo | 2205.039 |
| 2 | Christopher Nolan | 1937.453 |
| 3 | James Cameron | 1748.236 |
| 4 | Peter Jackson | 1597.313 |
fig=px.bar(data_frame=director_by_gross, x='Gross', y='Director', color='Director',
title="Top 10 Directors by Total Gross Net", text='Gross')
fig.show()
Steven Spielberg's Directed Movies Collected 2.478B $ Worldwide
dd=pd.DataFrame(df.groupby(['Series_Title','Director','Star1','Star2','Star3','Star4']).Gross.max().sort_values(ascending=False)[:5])
dd
| Gross | ||||||
|---|---|---|---|---|---|---|
| Series_Title | Director | Star1 | Star2 | Star3 | Star4 | |
| Star Wars: Episode VII - The Force Awakens | J.J. Abrams | Daisy Ridley | John Boyega | Oscar Isaac | Domhnall Gleeson | 936.662 |
| Avengers: Endgame | Anthony Russo | Joe Russo | Robert Downey Jr. | Chris Evans | Mark Ruffalo | 858.373 |
| Avatar | James Cameron | Sam Worthington | Zoe Saldana | Sigourney Weaver | Michelle Rodriguez | 760.508 |
| Avengers: Infinity War | Anthony Russo | Joe Russo | Robert Downey Jr. | Chris Hemsworth | Mark Ruffalo | 678.815 |
| Titanic | James Cameron | Leonardo DiCaprio | Kate Winslet | Billy Zane | Kathy Bates | 659.325 |
2014 most high IMDB Rated movies are Released.U-Unrestricted Content type has most top IMDB rated movies. TV-14,16,TV-MA and U/A has least IMDB rated movies.100 to 150 minutes. Longest run-time duration is 321 minutes, Shortest run-time duration is 45 minutes and Average run-time duration is 122 minutes.Gangs of Wasseypur which a hindi movie with duration 330 minutes. Shortest duration moview is Sherlock Jr with Runtime 45 minutes.The Shawshank Redemption is the highest rated movie with IMDB rating 9.3.The Godfather, Notorious, Boyhood, Casablanca and Fanny oach Alexander. are the top 5 Movies with High Meta Score.The Shawshank Redemption, The Dark Knights, Inception, Fight Club and Pulp Fiction movies received most votes.Adventure, Drama, Musical Genre.Animation, Drama, War Genre type movies has highest average rating.Mystery, Romance, Thriller Genre type movies has 100 Metascore.Action, Sci-FI Genre tye movies got nearly 1.2M average Votes.Family, Sci-Fi Genre type movies has highest average Gross Net about 435M Dollar. This is expected because some movie earns huge amount worldwide while some earns less.U/A Content or Certifcate type category has highest average VOtes.Alfred Hitchcock directed totaly 14 top rated movie. Steven Spielberg directed 13 top rated movies.Frank Darabont is the most successful director with average IMDB rating of 8.95.Steven Spielberg's Directed Movies is above 2.478B $ Worldwide